This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:
emuR and
EMU-SDMS, the EMU Speech Database Management Systemserve()ggplot2The generated HTML page of this tutorial is available here.
This tutorial is organised as an R Markdown notebook. To execute
the code within this notebook (filename Lecture.Rmd) it has
to be opened in RStudio. When executing code, the results appear beneath
the code. In order to do so, you need to have R and RStudio installed.
When this R notebook is loaded into RStudio, you can execute chunks by
clicking the Run button within the chunk or by placing your
cursor inside it and pressing Cmd+Shift+Enter, allowing you to
experiment with the code. This handout contains all output of the code
(tables, visualisations etc.). The easiest way to work with this R code
is to clone the entire project from this GitHub repository.
The following libraries will be needed and have to be installed.
library(tidyverse)
library(emuR)
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
library(htmltools)
library(joeyr)
library(knitr)
Before starting to work with the database, let’s have a quick overview how the automatic segmentation of audio files into smaller chunks like words and phonetic segments actually is working.
We are using the MAUS tools from the University of Munich. Next to a web interface, MAUS segmentation can be called directly from within R through an API.
The input is a stream of audio recording alongside of a written representation of the words spoken in this recording. A trained model of the language, here: Luxembourgish, then is used to map the sounds and words derived from the text to the speech signal. The output is an object with temporal alignments between words and sounds. This is a weak form of speech recognition.
Text for a test of forced alignment: Den ale Mann ass an dat kaalt Waasser gefall. (The old man fell into th cold water.) with the corresponding sound file save together in the same directory.
# create an emuDB from this text + audio
convert_txtCollection(dbName = "testDb",
sourceDir = "./testDB",
targetDir = ".")
# load this db
db <- load_emuDB("testDb_emuDB")
## INFO: Checking if cache needs update for 1 sessions and 1 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
# run forced aligment using BAS MAUS tools in R
runBASwebservice_all(db,
transcriptionAttributeDefinitionName = "transcription",
language ="ltz-LU",
verbose = TRUE,
runMINNI = FALSE,
resume = TRUE,
patience = 3)
Show a summary of the database.
# display the overview of the structure and content
summary(db)
##
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: testDb
## UUID: 93d5a88c-897b-4a1b-a4b5-f771b107b7df
## Directory: /Users/peter.gilles/Documents/GitHub/Data_Science_Humanities/testDb_emuDB
## Session count: 1
## Bundle count: 1
## Annotation item count: 53
## Label count: 72
## Link count: 50
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## data frame with 0 columns and 0 rows
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 2 bundle; transcription;
## ORT ITEM 3 ORT; KAN; KAS;
## MAU SEGMENT 1 MAU;
## MAS ITEM 1 MAS;
## ── Link definitions ──
## type superlevelName sublevelName
## ONE_TO_MANY bundle ORT
## ONE_TO_MANY ORT MAS
## ONE_TO_MANY MAS MAU
The result of the query can also be displayed in the EMU Speech Database Management System.
# open the database in a webviewer
serve(db)
The database contains the audio recordings from the
Lëtzebuerger Online Dictionnaire (available here, spoken by one female
speaker. The audio files have been automatically segmented with the MAUS
tools. We thus have a database conisting of textual data, basically
words, and the corresponding audio data. The audio data is segmented
into words and phonetic segments (sounds).
This database has been created beforehand. Infos how to create such a database is explained in the EMU-SDMS manual.
While we will be working in this tutorial with the full database
(26.000 recordings), a small demo database lod_emuDB has
been created with 500 recordings which can be used for individual
testing.
We start with loading the database and give an overview of structure and content.
# load database
db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
## INFO: Checking if cache needs update for 1 sessions and 26093 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
# for testing purposes, use this smaller database in the Github folder
#db = load_emuDB("lod_emuDB")
# display the overview of the structure and content
summary(db)
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: lod
## UUID: c87793c0-012a-11e9-874b-68b599b5deb4
## Directory: /Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB
## Session count: 1
## Bundle count: 26093
## Annotation item count: 515638
## Label count: 639542
## Link count: 489545
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## name columnName fileExtension
## dft dft dft
## praatFms fm praatFms
## dft2 dft dft
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 4 bundle; source; SAM; MAO;
## ORT ITEM 2 ORT; KAN;
## MAU SEGMENT 1 MAU;
## ── Link definitions ──
## type superlevelName sublevelName
## ONE_TO_MANY bundle ORT
## ONE_TO_MANY ORT MAU
Tracks in emuR are acoustic representation of the speech
signal, here dft for the waveform (time-amplitude
representation) and praatFms for the formant measures of
vowels (see below).
Levels in an emuR database stand for level of
interlinked linguistic information. bundle is the entire
audio file, ORT stands for the orthographical
representation of the audio file segmented in its single words.
MAU is the segmentation of all phonetic segment (=sounds)
of all ORT segments in bundle.
The hierarchical structure of these levels is expressed in the
link definitions as ONE-TO-MANY.
An emuR database can be queried with a powerful query engine. The
first example is a simple query for one word, Aarbecht.
sl = query(db, query = "[ ORT == 'Aarbecht']")
sl
The result is a segment list (sl),
containing various information about the found item (time, level, name,
database info). The result of the query can also be displayed in the EMU
Speech Database Management System.
serve(db, seglist = sl)
The GUI will open in the Viewer pane of RStudio or you
can open it in a browser (Chrome preferred).
Here we can also display the hierarchical structure for this database
item, which is accessed during queries. bundle is the
top-level, representing the entire audio file.
The ORT level contains the nodes for the individual
words in the bundle, here the two words
Aarbecht and Aarbechten. The dependent level
then is MAU (=Munich Automatic Unit)
representing the single sounds of the words in ORT.
Two aspects render emuR query system extremely powerful: the use of regular expressions (including negation and other extensions) and the combined query on different levels of the database.
Let’s try more complex queries:
Regular expression, operator =~, words beginning with
Aarbecht...
sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
## Warning in query_labels(emuDBhandle, attributeName = attributeName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
## first character in a sequence i.e. 'a.*' now also matches
## 'weakness' as it contains the sequence. '^a.*'
## matches sequences that start with 'a.*'
## e.g. the word 'amongst'.
sl
Query: Select the vowel \[aː\] in
all words beginning with Aarbecht… Note that in the segment
list the label now has changed to the vowel and the respective start-end
information is now only for this sound \[aː\].
sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
## Warning in query_labels(emuDBhandle, attributeName = attributeName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
## first character in a sequence i.e. 'a.*' now also matches
## 'weakness' as it contains the sequence. '^a.*'
## matches sequences that start with 'a.*'
## e.g. the word 'amongst'.
sl
Query a sequence of sounds, e.g. e followed by
k (Méck), by using the sequence operator
->.
sl = query(db, query = "[ MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
In the sequence e->k, query only the vowel. Use the
result modifier #.
sl = query(db, query = "[ #MAU == e -> MAU == k ]")
# print only the first 100 rows
sl[1:100, ]
Wuery all sound items that occur at the end of a word and are \[p\].
sl <- query(db, query = "[End(ORT, MAU) == TRUE & MAU == p ]")
sl
Retrieve all word that contain five segments.
sl <- query(db, "[Num(ORT, MAU) == 5]")
sl
Using requery, the results from a previous query can be
further specified.
# requery_seq()
# query all "m" phonetic items
sl_m = query(db, "MAU == m")
# sequential requery (left shift result by 1 (== offset of -1))
# and hence retrieve all phonetic items directly preceeding
# all "m" phonetic items
sl_req_n <- requery_seq(db,
seglist = sl_m,
offset = -1)
sl_req_n
Groups of sounds can be grouped together to
label groups: - all long monophthongs:
"iː", "uː", "aː", "oː", "ɔː", "ɛː", "eː" - all short
monophthongs: "i", "u", "ɑ", "o", "æ", "e", "ə", "ɐ"
# query all long monophthongs
sl = query(db, "MAU == longMonophthongs")
sl
With the query the user can compile a data frame from the database which then forms the subset for further phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.
Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.
The first task is to extract the time-amplitude waveform
representation (oscillogram) for certain single sounds, e.g. some long
aː.
# query all "aː" phonetic items
#
sl = query(db, "MAU == aː")
# instead of all 7,000 vowels take only 6
sl = sl[100:105, ]
The data for the oscillogram is stored in the SSF track
dft, which is extracted by get_trackdata. The
result is another R data frame.
# get "dft" track data for these segments
a_vowels = get_trackdata(emuDBhandle = db,
seglist = sl,
ssffTrackName = "MEDIAFILE_SAMPLES",
verbose = TRUE)
## Warning in get_trackdata(emuDBhandle = db, seglist = sl, ssffTrackName = "MEDIAFILE_SAMPLES", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate:
## session bundle media_file sample_rate
## 1 0000 AARBECHTSPLAZ1 AARBECHTSPLAZ1.wav 48000
## 2 0000 AARBECHTSPROZESS1 AARBECHTSPROZESS1.wav 48000
## 3 0000 AARBECHTSRECHT1 AARBECHTSRECHT1.wav 44100
## md5_annot_json
## 1 a4df301bb5ca34b46a0b8438677334d9
## 2 6156d1a162658be190fa7568a00a6223
## 3 ef1565335d9bd950a71c7ff9c4d06540
##
## INFO: parsing 6 wav segments/events
##
|
| | 0%
|
|============ | 17%
|
|======================= | 33%
|
|=================================== | 50%
|
|=============================================== | 67%
|
|========================================================== | 83%
|
|======================================================================| 100%
The oscillograms for these 6 vowel instances can then be visualised
with ggplot2.
# plot oscillogram
ggplot(data = a_vowels) +
# define the df columns for the x and y values
aes(y = T1, x = times_rel) +
# line chart
geom_line() +
# sl_rowIdx groups all rows in a data frame belonging to the same segment
# labels contains the label of the segment
facet_wrap(~ sl_rowIdx + labels)
# query all words beginning with 'Fluch'
#
sl = query(db, query = "[ ORT =~ 'Fluch.*']")
## Warning in query_labels(emuDBhandle, attributeName = attributeName, intermResTableSuffix = intermResTableSuffix, : =~ now requires ^ if you wish to match the
## first character in a sequence i.e. 'a.*' now also matches
## 'weakness' as it contains the sequence. '^a.*'
## matches sequences that start with 'a.*'
## e.g. the word 'amongst'.
# get "f0" track data for these segments, in this case calculated on the fly
fluch_words = get_trackdata(emuDBhandle = db,
seglist = sl,
# using emuR's Michel Scheffers’ Modified Harmonic Sieve algorithm
onTheFlyFunctionName = "mhsF0",
verbose = TRUE)
## Warning in get_trackdata(emuDBhandle = db, seglist = sl, onTheFlyFunctionName = "mhsF0", : The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate:
## session bundle media_file sample_rate
## 1 0000 FLUCH1 FLUCH1.wav 44100
## 2 0000 FLUCH2 FLUCH2.wav 44100
## 3 0000 FLUCHBILLJEE1 FLUCHBILLJEE1.wav 44100
## 4 0000 FLUCHBLAT1 FLUCHBLAT1.wav 44100
## 5 0000 FLUCHFELD1 FLUCHFELD1.wav 44100
## 6 0000 FLUCHGESELLSCHAFT1 FLUCHGESELLSCHAFT1.wav 44100
## 7 0000 FLUCHHAFEN1 FLUCHHAFEN1.wav 44100
## 8 0000 FLUCHROUTE1 FLUCHROUTE1.wav 48000
## 9 0000 FLUCHSCHAIN1 FLUCHSCHAIN1.wav 48000
## 10 0000 FLUCHT1 FLUCHT1.wav 44100
## 11 0000 FLUCHTGEFOR1 FLUCHTGEFOR1.wav 44100
## 12 0000 FLUCHTICKET1 FLUCHTICKET1.wav 44100
## md5_annot_json
## 1 4f96300f1fd30205575bd557c6464ed5
## 2 d4d6ab259206af2592edab568a1f3169
## 3 2c9d18baef54d1ed2dfe97af377dbb47
## 4 d21730527f22712aaa0e002341f3831c
## 5 3289c832ff6ff8b42d1803163d544cce
## 6 ef31e44d088bf2461cdae5284db0b1c5
## 7 44ca764332ca16eb00f72978113cd908
## 8 2d055afc598b0f4c44fa38afd7b295da
## 9 82e2fff18e70a74058e978932dea9e50
## 10 17ecb177ca527fc055c778f8708be1ae
## 11 836a096c2af4b9dd1acdfe6163e3570e
## 12 6896b5678da288e71af86bf414c67fb1
##
## INFO: applying mhsF0 to 21 segments/events
##
|
| | 0%
|
|=== | 5%
|
|======= | 10%
|
|========== | 14%
|
|============= | 19%
|
|================= | 24%
|
|==================== | 29%
|
|======================= | 33%
|
|=========================== | 38%
|
|============================== | 43%
|
|================================= | 48%
|
|===================================== | 52%
|
|======================================== | 57%
|
|=========================================== | 62%
|
|=============================================== | 67%
|
|================================================== | 71%
|
|===================================================== | 76%
|
|========================================================= | 81%
|
|============================================================ | 86%
|
|=============================================================== | 90%
|
|=================================================================== | 95%
|
|======================================================================| 100%
In the corresponding graphs then the track for the fundamental frequency (f0) for some isolated words will be drawn.
# plot f0 tracks
ggplot(data = fluch_words) +
# define the df columns for the x and y values
# T1 here is the f0 value
aes(y = T1, x = times_rel) +
# line chart
geom_line() +
# sl_rowIdx groups all rows in a data frame belonging to the same segment
# labels contains the label of the segment
facet_wrap(~ sl_rowIdx + labels)
If needed, you can check the content of your segment list with the
EMU WebApp running serve(seglist = sl).
The study of duration of speech segments is a standard task in
acoustic phonetics. Let’s see how to solve this in R.
Thanks to the label groups which are available in the
database, we have easy access to e.g. all shortMonophthongs
and all longMonophthongs.
# query all long and short vowels
longMonophthongs = query(db, "MAU == longMonophthongs")
shortMonophthongs = query(db, "MAU == shortMonophthongs")
# bind the two together to one segment list sl
sl = rbind(longMonophthongs, shortMonophthongs)
# print the count; number of rows
nrow(longMonophthongs)
## [1] 22950
nrow(shortMonophthongs)
## [1] 104115
The data frame contains per segment a start and end column, which allows for simple calculation of the segment duration.
# calculate durations
duration_long = longMonophthongs$end - longMonophthongs$start
duration_short = shortMonophthongs$end - shortMonophthongs$start
# calculate mean and standard deviation
paste("The mean duration for long monophthongs in the databse is", round(mean(duration_long)), "ms. The standard deviation is:", sd(duration_long), ".")
## [1] "The mean duration for long monophthongs in the databse is 153 ms. The standard deviation is: 62.9194511796307 ."
paste("The mean duration for short monophthongs in the databse is", round(mean(duration_short)), "ms. The standard deviation is:", sd(duration_short), ".")
## [1] "The mean duration for short monophthongs in the databse is 106 ms. The standard deviation is: 51.422737823314 ."
Instead of having these broad categories long and
short we can break them down to the individual vowels. The
dataframes just created contain the individual vowels already in the
labels column and can thus be easily utilised.
head(longMonophthongs)
First, we create an aggregated dataframe with all vowel segments. This aggregated dataframe now contains 127,065 rows (= long/short vowels).
# add new columns for length, either long or short
longMonophthongs$length <- "long"
shortMonophthongs$length <- "short"
# then merge the two data frame into a single one
all_vowels <- rbind(longMonophthongs, shortMonophthongs)
# add a new colum for segment duration
all_vowels$duration <- round(all_vowels$end - all_vowels$start)
# todo: keep propper sl throughout, will be needed later for requeries!
sl <- all_vowels %>% filter(!str_detect(bundle, "^[CDDEF].*"))
sl$length <- NULL
sl$duration <- NULL
Using the powerful library dplyr, we can now calculate
the means for all vowels (= labels).
duration_vowels <- all_vowels %>%
group_by(labels, length) %>%
summarise(mean = mean(duration), sd = sd(duration))
## `summarise()` has grouped output by 'labels'. You can override using the
## `.groups` argument.
duration_vowels
Finally, visualise the means in a bar chart
# basic setting for ggplot with data and aesthetics
ggplot(data = duration_vowels, aes(x= labels, y= mean, fill = length)) +
geom_bar(stat="identity", position=position_dodge()) +
# add errorbars
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=.2,
position=position_dodge(.9)) +
theme_minimal()
From here on the duration values can be utilised in further statistical analyses.
ggplot2Formant frequencies are the acoustic representation of the vowel’s
quality. Of these, the first two formants are crucial for the vowels
identity, i.e. F1 and F2. The values for the formants are already
calculated in the database and will now be extracted for all the vowels
our dataframe all_vowels.
As it takes long to do this, the dataframe has been precompiled and saved to file … and reloaded again.
#saveRDS(td_vowels, file ="td_vowels.RDS")
# td_vowels <- readRDS("td_vowels.RDS")
Due to the varying length of all vowels, it is necessary to normalize
the length with normalize_length. This step takes long and,
again, the dataframe has already been created and saved.
# normalise the length
#td_vowels_norm = normalize_length(td_vowels %>% select(-duration, -length), colNames = c("T1", "T2"))
# do some cleaning
# exclude words spoken by another (male) speaker
# td_vowels_norm <- td_vowels_norm %>%
# filter(!str_detect(bundle, "^[CDDEF].*"))
# save this dataframe
#saveRDS(td_vowels_norm, file ="td_vowels_norm.RDS")
# load this data frame
td_vowels_norm <- readRDS("td_vowels_norm.RDS")
Formants are calculated for every sampling point of the audio file.
Thus, the dataframe td_vowels_norm now contains some 2,1
million rows/values.
For illustration purposes, the formants for the long vowels in three example words are selected into a dataframe (maachen, Viz, Kuuscht).
formant_examples <- td_vowels_norm %>%
filter(sl_rowIdx == "722" | sl_rowIdx == "21702" | sl_rowIdx == "11191")
These three vowels then are visualised as line charts.
ggplot(data=formant_examples) +
geom_smooth(aes(x = times_norm, y = T1, col = labels, group = labels)) +
geom_smooth(aes(x = times_norm, y = T2, col = labels, group = labels)) +
labs(x = "Duration (normalized)", y = "F1 + F2 (Hz)") +
ggtitle("Example formants F1, F2 the three corner vowels") +
facet_wrap(~ labels)
To get the identity of a vowel, it is common practice to select only
one point in the middle of the vowel. To do so, I am using a
filter and select the F1 and F2 values at the normalised
time point 0.4. This gives us F1 and F2 values for 104,055
vowels.
vowel_midpoints = td_vowels_norm %>%
filter(times_norm == 0.4)
vowel_midpoints$labels <- as.factor(vowel_midpoints$labels)
Again, some further data conversion.
# convert Hertz to Bark
# convert T1, T2, T3 to Bark
td_vowels_norm$T1 <- emuR::bark(td_vowels_norm$T1)
td_vowels_norm$T2 <- emuR::bark(td_vowels_norm$T2)
# insert new columns from requeries to have the info about the next and next_next and previous label after the vowel available in the dataframe
# vowel_midpoints$next_label = as.factor(requery_seq(db, seglist = sl, offset=1)$labels)
# vowel_midpoints$next_next_label = as.factor(requery_seq(db, seglist = sl, offset=2, ignoreOutOfBounds =TRUE)$labels)
# vowel_midpoints$word = requery_hier(db, seglist = sl, level="ORT")$labels
# vowel_midpoints$previous_label <- as.factor(requery_seq(db, seglist = sl, offset= -1)$labels)
#
# saveRDS(vowel_midpoints, "vowel_midpoints.RDS")
vowel_midpoints <- readRDS("vowel_midpoints.RDS")
The dataframe vowel_midpoints is now enriched with
additional columns for the actual word (of a vowel
instance), previous_label, next_label and
next_next_label, which allows to select data more
fine-grained for the following visualisations, e.g. all vowels
aː with a g in front and an R
after it.
head(vowel_midpoints)
Prepared in such a way, the visualisations of this sheer amount of
+100,000 vowels can commence. Vowel quality is charted in scatter plots,
where the F1 value is plotted on the y-axis and the F2 value on the
y-axis. To gain resemblance with the articulation of vowels, both axes
are reversed. Thus i is located in the top front,
u in the top back and a in the low
position.
The result looks rather flashy, but nevertheless has some flaws. First, there are several outliers, especially in the SW corner. An elimination procedure for outliers is needed. Secondly, the single clouds for each vowels are overlapping each other to such an extent that the proper location is not identifiable.
# takes too long: do not run with full dataset!
# ggplot(vowel_midpoints) +
# aes(x = T2, y = T1, label = labels, col = labels) +
# geom_text() +
# scale_y_reverse() +
# scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
# labs(x = "F2 (Hz)", y = "F1 (Hz)") +
# ggtitle("Scatter plot of all vowels in the LOD database")
# ggsave("plot_vowel_midpoints.png")
Vowel plot of all vowels in the LOD database
Next task: remove outliers. I am using a function based on the
(Mahalnobis distance)\[<https://en.wikipedia.org/wiki/Mahalanobis_distance>\],
which incrementally excludes outliers up to a predefined threshold. In
the present case we keep 85 % of the values.
# outlier computations takes also a while
# to be on the safe side, also save this dataframe to disk
#saveRDS(without_outliers, file="without_outliers.RDS")
without_outliers <- readRDS("without_outliers.RDS")
without_outliers contains some 88,000 rows, meaning
nearly 20,000 have been identified and removed as outliers.
To reduce the amount of data points in the scatter plot, one can draw
contour plots (following this
idea). This method geom_density2d facilitates to
identify the core zones of vowel realisations, i.e. where - similar to a
weather chart - the lines are closer together.
ggplot(without_outliers, aes(x = T2, y = T1, color=labels)) +
geom_density2d(aes(label= labels)) +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Contour plot of all vowels in the LOD database")
## Warning: Ignoring unknown aesthetics: label
Instead of displaying each and every data point, draw only an ellipse
for the extent of a vowel cloud. For each ellipse the middle point, the
so-called centroid is also computed.
# plot centroids and ellipses
centroid = without_outliers %>%
group_by(labels) %>%
summarize(F1=mean(T1), F2=mean(T2))
ggplot(without_outliers) +
aes(x=T2,y=T1, col=labels,label=labels)+
stat_ellipse() +
geom_text(data=centroid,aes(x=F2,y=F1)) +
scale_y_reverse() + scale_x_reverse() +
labs(x = "F2 (Hz)", y = "F1 (Hz)") +
theme(legend.position="none") +
ggtitle("Dispersion of vowels in Luxembourgish")
A further possibility is to use the function facet_wrap
which creates single plot for the factors of a variable. Here,
individual plots are displayed according to the labels ( = vowels) in
the dataframe. The elliptoid shape of each cloud is a result of the
outlier removal procedure.
ggplot(without_outliers %>% filter(!(labels =="ɐ" | labels == 'ə' | labels == 'ɔː'))) +
aes(x = T2, y = T1, label = labels, col = labels) +
geom_density2d(aes(label= labels)) +
#geom_text() +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Scatter plot of all vowels in the LOD database") +
facet_wrap(~ labels)
## Warning: Ignoring unknown aesthetics: label
From the visualisation shown so far it becomes obvious that vowels are not neatly separated but rather often show a certain amount of overlap. This is either due to measurement errors or is indicating that a vowel production is changing, i.e. that we are dealing with sound change. Sound change usually manifests itself in subconscious, minimal deviations from a former prototypical sound and these deviation may become more prominent at a later stage.
In Luxembourgish, too, several sound changes are in progress for the vowels. The last example in this tutorial covers the variational pattern of the two e-vowels, \[e\] (in words like Méck, bréngen, sécher) and \[ə\] (like in Schëff, Dësch). While for elder speakers these vowels are distinct, there is an ongoing merger for younger speaker by pronouncing the vowel before the certain consonants like \[ə\] instead of \[e\].
To tackle this task, we first need the relevant data, which has been
prepared in a dataframe merger.
e_merger_data <- without_outliers %>%
filter(labels %in% c("e", "ə")) %>%
filter(next_label %in% c("ɕ", "ʃ"))
saveRDS(e_merger_data, "e_merger_data.RDS")
merger <- readRDS("e_merger_data.RDS")
With around 2,100 rows this dataset is not too large for a scatter plot. The relevant information is the horizontal overlap between the clouds. While two of the red ones (\[k\], \[ŋ\]) are quite distinct from the green one, it is the cloud for \[ɕ\] which is showing overlap with the cloud for \[ʃ\]. Here we have clear evidence of an ongoing vowel merger, i.e. the e-vowel realisations before \[ɕ\] and \[ʃ\] are becoming more similar.
ggplot(merger) +
aes(x = T2, y = T1, label = labels, col = labels) +
geom_text() +
scale_y_reverse() +
scale_x_reverse(breaks = seq(from=7, to=14, by=1), minor_breaks = NULL) +
labs(x = "F2 (Bark)", y = "F1 (Bark)") +
ggtitle("Scatter plot of the vowels [ə] and [e] in the LOD database") +
facet_wrap(~ next_label)
The actual amount of overlap between two cloud-like data sets can be
calculated with the so-called Pillai score Details see here.
Basically, high Pillai scores indicate distinct clouds, low Pillai score indicate merger. (Source of figure)
Let’s now calculate the Pillai scores for the overlap. We extract the rows for \[e\] before \[ɕ\] and \[ə\] before \[ʃ\], respectively, and merge these to a temporary df ‘vowels’.
# select rows for the two sounds
e_ch_vowels <- merger %>%
filter((labels == "e" & next_label =="ɕ"))
e_sch_vowels <- merger %>%
filter((labels == "ə" & next_label =="ʃ"))
# new df with the selected rows
vowels <- rbind(e_ch_vowels, e_sch_vowels)
# calculate the Pillai score
pillai(cbind(T1, T2) ~ labels, data = vowels)
## [1] 0.2905629
Remember, the Pillai score ranges between 0 and 1. The lower the score, the closer two groups of realisations. With this rather low value the Pillai score implies that the two vowel realisations are indeed rather close and merging into each other.